![]() |
The Twofish Encryption Algorithm: A 128-Bit Block Cipher
by Bruce Schneier ; John Kelsey ; Doug Whiting ; David Wagner ; Chris Hall ; Niels Ferguson Wiley Computer Publishing, John Wiley & Sons, Inc. ISBN: 0471353817 Pub Date: 03/01/99 |
Previous | Table of Contents | Next |
For the ultimate in key agility, a full 256 Mbytes of precomputed tables could comprise all four S-boxes for the final two stages of q0, q1, covering all 216 key byte possibilities for the 128-bit key case, and including the MDS matrix multiply. With a good memory subsystem, such a version should cut another 1000 clocks or so out of the above key setup times. Clearly, this is a fairly expensive solution (at least with todays technology), but it illustrates the flexibility of Twofish very nicely.
Any performance measures that do not take key setup into account are only valid for asymptotically large amounts of text. For shorter messages, performance is the sum of key setup and encryption. For very short messages, the key setup time can overwhelm the encryption speed.
Table 5.3 gives Twofishs performance on the Pentium Pro (assembly-language version), both 128-bit key setup and encryption, for a variety of message lengths. This table assumes the best of our implementations (not including the large-memory implementations) for the particular length of text.
Hash functions built from block ciphers generally require the algorithm to be uniquely keyed for each text block [Pre93, Sch96]. Assuming the chaining variables are the plaintext and ciphertext, and the block to be hashed is the key, Twofish can hash text at a rate of 175 clock cycles per byte on a Pentium, assuming a 128-bit key, and 132 clock cycles per byte on a Pentium Pro/II.
Plaintext (bytes) | Keying Option | Clocks to Key | Clocks to Encrypt | Total Clocks per Byte |
---|---|---|---|---|
16 | Zero | 1250 | 860 | 131.9 |
32 | Zero | 1250 | 1720 | 92.8 |
64 | Zero | 1250 | 4690 | 73.3 |
128 | Minimal | 2400 | 6880 | 63.5 |
256 | Partial | 4900 | 7360 | 47.9 |
512 | Compiled | 8600 | 8256 | 32.9 |
1K | Compiled | 8600 | 16512 | 24.5 |
2K | Compiled | 8600 | 33024 | 20.3 |
4K | Compiled | 8600 | 66048 | 18.2 |
8K | Compiled | 8600 | 132096 | 17.2 |
16K | Compiled | 8600 | 264192 | 16.7 |
32K | Compiled | 8600 | 528384 | 16.4 |
64K | Compiled | 8600 | 1056768 | 16.3 |
1M | Compiled | 8600 | 270532608 | 16.1 |
Table 5.3. Best Speed to Encrypt a Message with a New 128-bit Key on a Pentium Pro |
As with most algorithms, the choices of language and compiler can have a huge impact on performance. It is clear that the Borland C 5.0 compiler chosen as the standard AES reference is not the best optimizing compiler. For example, the Microsoft Visual C++ 4.2 compiler generates Twofish code that is at least 20 percent faster than Borland on a Pentium computer, with both set to optimize for speed (630 clocks per block for Microsoft Visual C++ 4.2 versus 870 clocks per block for Borland C 5.0), on a Pentium Pro/II, the difference between the compilers is not quite as large (e.g., 600 clocks/block vs. 640 clocks/block), but it is still significant. Part of the difference stems from the inability of the Borland C compiler to generate intrinsic rotate instructions, despite documentation claiming that it is possible. This problem alone accounts for nearly half of the speed difference between Borland and Microsoft. The remaining speed difference comes simply from poorer code generation. The Borland C compiler is uniformly slower than Microsofts compiler. The encryption speed in Microsoft C of 40 Pentium clocks per byte (i.e., 630 clocks/block at 16 bytes/block) is over ten percent faster than the best known DES assembly language implementation on the same platform. However, coding the Twofish algorithm in assembly language achieves speeds of 258 clocks, achieving a very significant speedup over any of the C implementations.
To make matters even more complicated, the assembly language that optimizes performance on a Pentium (or Pentium MMX) is drastically different from the assembly language required to maximize speed on a Pentium Pro or Pentium II, even though the final code size and speed achieved on each platform are almost identical. For example, the Pentium Pro/II CPUs can perform only one memory read per clock cycle, while the Pentium and Pentium MMX can perform two. However, the Pentium Pro/II can perform two ALU operations per clock in addition to memory accesses, while the Pentium can process only a total of two ALU operations or memory accesses per clock. These (and other) significant architectural differences result in the fact that running the optimal Pentium Twofish encryption code on a Pentium Pro results in a slowdown of nearly 2:1and vice versa! Fortunately, it is relatively simple to detect the CPU type at run-time and select which version of the assembly code to use. Another anomaly is that the key schedule setup time is considerably faster (43 percent) on the Pentium MMX than on the Pentium, not because the key schedule uses any MMX instructions, but simply because of the larger cache size of the MMX chip.
Previous | Table of Contents | Next |